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Abstract 

The classical approach to system identification is based on statistical assumptions about the 

o: 

p^ , measurement error, and provides estimates that have stochastic nature. Worst-case identification, on 

^^. the other hand, only assumes the knowledge of deterministic error bounds, and provides guaranteed 

estimates, thus being in principle better suited for its use in control design. However, a main limitation 



of such deterministic bounds lies on the fact that they often turn out to be overly conservative, thus 
^^ I leading to estimates of limited use. 

r""'. In this paper, we propose a rapprochement between these two paradigms, stochastic and worst- 

rj^ • case, and propose a novel probabiUstic framework for system identification that combines elements 

c/3 ' from information-based complexity with recent developments in the theory of randomized algorithms. 

The main idea in this line of research is to "discard" sets of measure at most e, where e is a 
Cn , probabilistic accuracy, from the set of deterministic estimates. Therefore, we are decreasing the 

0> ■ so-called worst-case radius of information at the expense of a given probabilistic "risk." 

(N ■ 

T+ ' In this setting, we compute a trade-off curve, called violation function, which shows how the 

radius of information decreases as a function of the accuracy. To this end, we construct randomized 

m : 

f^ , and deterministic algorithms which provide approximations of this function. The obtained results are 

CN ■ 

based upon specific properties regarding the intersection of convex sets. 
- -"2 Keywords: System identification and filtering, optimal algorithms, randomized algorithms, uncertain 



H ■ systems 
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I. Introduction and Preliminaries 

Information-based complexity (IBC) is a general theory developed within computer science for 
studying the complexity of problems approximately solved due to the presence of partial and/or con- 
taminated information; see |!52| and ll53l and references therein. Typical applications of IBC include 
distributed computations, clock synchronization, computer vision, numerical integration, solution of 
nonlinear equations and large scale linear systems, as well as system identification and filtering. 
Depending on the specific application, different settings are studied: worst-case, average-case and 
probabilistic. In each setting, the objective is threefold: to derive optimal algorithms, to compute the 
associated approximation errors and to establish their computational complexity. 

In the last decades, several authors focused their attention on the so-called set-membership identi- 
fication which aims at the computation of hard bounds on the estimation eixors, see for instance |53, 
and |[26l for pointers to more recent developments. Set-membership identification may be embedded 
within the general framework of worst-case IBC, so that various systems and control problems, such 
as time-series analysis, filtering and Tioo identification can be addressed ll35l . P9l . ll25l . Q, PTI . In 
this setting, the noise is a deterministic variable bounded within a set of given radius. The objective 
is to derive optimal algorithms which minimize (with respect to the noise) the maximal distance 
between the true-but-unknown system parameters and their estimates. The drawback of the hard 
bound approach is that the estimation errors may be too large in many instances and therefore of 
limited use, in particular when the ultimate objective is to use system identification in the context of 
closed-loop control. 

The worst-case setting provides a counterpart to the mainstream stochastic approach to system 
identification, see |!33l| and the special issues |[34l . P31 . which has been very successful also in many 
applications, such as e.g. process control, systems biology and optimization. This approach assumes 
that the available observations are contaminated by random noise, and has the goal to derive soft 
bounds on the estimation errors. In this case, optimality is guaranteed in a probabilistic sense and 
the resulting algorithms often enjoy convergence properties only asymptotically. 

The worst-case setting is based on the "concern" that the noise may be very malicious. The 
computed bounds are certainly more pessimistic than the stochastic ones, but the idea is to guard 
against the worst-case scenario, even though it is unlikely to occur. These observations lead us to 
discuss the rap/jroc/iemen? viewpoint, see ll38l . ll23l . BOl . lfT2l . lIlTI . which has the following starting 
point: the measurement noise is confined within a given set (and therefore it falls under the framework 
of the worst-case setting), but it is also a random variable with given probability distribution (so that 
statistical information is used). A simple example is uniformly distributed noise with a supporting 
set which is that adopted by the worst-case methods. We recall that the rapprochement approach has 
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been extensively studied in the context of control design in the presence of uncertainty, see Il50l . lITOl . 
ll9l . and Q. This research provides a rigorous methodology for deriving controllers guaranteeing the 
desired performance specifications with high level of probability. 

The focal point of this paper is to address the rapprochement between soft and hard bounds in 
a rigorous fashion, with the goal to derive useful computational tools for system identification and 
filtering. The specific problem formulation we consider is the probabilistic setting of information-based 
complexity and the objective is to compute (by means of randomized and deterministic algorithms) 
the so-called probabilistic radius of information, with emphasis on uniformly distributed noise. We 
remark that, contrary to the statistical setting which mainly concentrates on asymptotic results, the 
probabilistic radius introduced in this paper provides a quantification of the estimation error which 
is based on a finite number of observations. In this sense, this approach has close relations with the 
works based on statistical learning theory proposed in ||29l . ll56l . |[54l . and with the approach in |[T4l . 
lITSl . where noise-free non-asymptotic confidence sets for the estimates are derived. Furthermore, the 
paper is also related to the work lISTI . where a probabilistic density function over the consistency set 
is considered. 

We now provide a preview of the structure and main results of the paper. Section JII] presents 
an introduction to information-based complexity. To summarize, we are interested in computing an 
optimal approximation of Sx € Z where 5 is a given linear mapping S : X ^ Z C W, and 
X £ X C M"; X and S are called, respectively, problem element and solution operator (that is, 
Sx represents a linear combination of the unknown parameters of the system to be identified). The 
element x is not exactly known but only approximate information y = Zx + g is available, where Z 
is called information operator, which is linear, and the noise q is confined within a norm-bounded set 
Q C M™. An approximation to Sx is provided by an algorithm (or estimator) A, generally nonlinear, 
acting on the information y. Optimal algorithms minimize the maximal distance between the true- 
but-unknown solution Sx and the estimated solution A{y) for the worst-case noise q £ Q. The error 
of an optimal algorithm is called the worst-case radius of information. This section also contains 
the formal definition of the consistency set Z~^{y) which plays a major role in this paper. Roughly 
speaking, this is the set of all parameters x which are compatible with the given data y, the model 
y = Zx + q and the noise g G Q. Section |lll] includes two examples demonstrating how system 
parameter identification, prediction and filtering may be formulated in the general IBC framework. 

Section |IV] introduces the probabilistic setting. In this context, the idea is to "discard" sets of 
(probabilistic) measure at most e from the consistency set, having the objective of decreasing the worst- 
case radius, thus obtaining a new error which represents the probabilistic radius of information. In 
other words, the worst-case radius is decreased, possibly significantly, at the expense of a probabilistic 
risk e. To discard "bad sets" of small measure is exactly the main feature of the probabilistic approach 
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to analysis and design of uncertain systems proposed in jSOl . This approach may be very useful, for 
example, for system identification in the presence of outliers Q, where "bad measurements" may be 
discarded. In this section, by means of a chance-constrained approach ||37l , we also show that the 
probabilistic radius is related to the minimization of the so-called optimal violation function Vo{r). 
Roughly speaking, this function describes how the risk e decreases as a function of the radius r. 
However, the computation of Vo{r) is not an easy task and requires the results proved in Section |V] 
and the algorithms presented in Section |Vl] Section JV] ends showing a tutorial example regarding 
estimation of the parameters of a second order model corrupted by additive noise. The example in 
continued in other sections of the paper for illustrative purposes. 

Section |V] deals with uniformly distributed noise and contains the main technical results of the 
paper. Theorem [T] shows that the induced measure (through the inverse of the operator information 
Z) over the consistency set Z~^(y) is uniform and the induced measure of the set SZ~^{y) (which 
is the transformation of the consistency set Z^^{y) through the solution operator S) is log-concave. 
Theorem |2] proves crucial properties, from the computational point of view, of the optimal violation 
function Vo{r). In particular, this result shows that Vo{r) is non-increasing, and for fixed r > 0, it 
can be obtained as the maximization of a specially constructed unimodal function. Hence, it may be 
easily computed by means of various optimization techniques which are discussed in the next section. 

In Section |Vl] we introduce specific algorithms for computing the optimal violation function. First, 
we observe that the exact computation of Vo{r) requires the evaluation of the volume of polytopes. 
Since this problem is NP-Hard ll30l . we propose to use suitable probabilistic and deterministic 
relaxations. More precisely, first we present a randomized algorithm based upon the classical Markov 
Chain Monte Carlo method B8l . 1471 . which has been studied in the context of randomized stochastic 
approximation methods |[32l : see also ||50j| and @ for further details about randomized algorithms. 
Secondly, we present a deterministic relaxation of Vo{r) which is based upon the solution of a semi- 
definite program (SDP). The performance of both algorithms is shown using the example given in 
Section El 



Section IVIII discusses normally distributed noise, and presents some connections with classical 
stochastic estimation. In particular, it is shown that the least-squares algorithm is optimal also in the 
probabilistic setting discussed in this paper. For this case, we state a bound (which is essentially tight 
for small-variance noise) on the probabilistic radius of information, which is given in Il52l in terms 
of the so-called average radius of information, and it depends on e, on the noise covariance, and on 
the information and solution operators. 
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II. Notation 

In this section, we provide the notation used in this paper. We write || • ||, || • ||2 and || • ||oo to denote 
the Ip, I2 and l^o norms, respectively. The ip norm-ball of center ^c and radius r is defined as 

and we write B{r) = i3(0, r). We denote by 62(^0?^) and by Boo{ic,f) the I2 and £00 norm-balls, 
respectively. We use the notation a; ~ p^i to indicate that the random vector x has probability density 
function (pdf) pa{x) with support set A. The uniform density Ua over the set yl C M" is defined as 

f l/vol[^] ifxGA; 
Ua{x) = < 

I otherwise 

where vol [A\ represents the Lebesgue measure (volume) of the set A, see |[22l for details regarding 

volumes of sets. The uniform density Ua generates a uniform measure ^iu{A) such that, for any 

measurable set B, fJ.u(A){B) = vol [B n A] /vol [A]. The multivariate normal density A/^.s with mean 

value X € M", symmetric positive definite covariance matrix S ;^ is defined as 

The normal probability density function generates the so-called Gaussian measure fj,j^f. The (univariate) 
unilateral Gamma density with parameters a, 6 G M is defined 

G'a,6(x) = --^x'^-V^/^ X>0, (1) 

where r(-) is the Gamma function. We denote by I (•) the indicator function, which is equal to one 
if the clause is true, and it is zero otherwise. The n x n identity and the n x m zero matrices are 
indicated by /„ and On,m, respectively. A set H is said to be centrally symmetric with center x if 
X G H implies that its reflection with respect to x also belongs to H, i.e. (2x — x) ^ H. 

III. Information-Based Complexity 

This section provides a formal overview of the information-based complexity definitions used in 
the paper and two introductory examples illustrating the main concepts in a system identification 
and filtering contexts. An introduction to the IBC framework is given in |[53l and in flU : see the 
monograph ll52l for an advanced treatment of the topic. The relevant spaces, operators and sets 
discussed next are shown in Figure [T] 

Let X be a linear normed n-dimensional space over the real field, which represents the set of 
(unknown) problem elements x G X. Define a linear operator Z, called information operator, which 
maps X into a linear normed m-dimensional space Y 

T:X^Y. 
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Sx-A{y) 




Fig. 1. Illustration of the information-based complexity framework. 

In general, exact information about the problem element x € X is not available and only perturbed 
information, or data, y £ Y is given. That is, we have 



y = Zx + 1 



(2) 



where q represents uncertainty which may be deterministic or random. In either case, we assume that 
(7 € Q, where Q C M™ is a bounding set. Due to the presence of uncertainty q, the problem element 
X £ X may not be easily recovered knowing data y ^Y . Then, we introduce a linear operator S, 
called a solution operator, which maps X into Z 

S:X ^ Z 

where Z is a linear normed s-dimensional space over the real field, where s < n. Given S, our aim is 
to estimate an element Sx G Z knowing the corrupted information y GY about the problem element 

xex. 

An algorithm A is a mapping (in general nonlinear) from Y into Z, i.e. 

A:Y ^ Z. 

An algorithm provides an approximation A{y) of Sx using the available information y gY ot x G X. 
The outcome of such an algorithm is called an estimator z = A{y). 
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We now introduce a set which plays a key role in the subsequent definitions of radius of information 
and optimal algorithm. Given data y E y, we define the consistency set as follows 

^~^{y) = {x G X \ there exists q G Q : y = Ix + q} (3) 

which represents the set of all problem elements x G X compatible with (i.e. not invalidated by) Ix, 
uncertainty q and bounding set Q. Note that, under the sufficient information assumption stated next, 
the set X^^(y) is guaranteed to be bounded. For the sake of simplicity, we assume that the three sets 
X, Y, Z are equipped by the same ip norm, and that the set Q is an ip norm-ball of radius p, that is 
Q = B{p). Note that in this case the set X^^(y) can be written as 

X-\y) = {xGX\\\Xx-y\\<p}. (4) 

The following assumption regarding the operators T and S is now introduced. 

Assumption 1 (Sufficient information and feasibility): We assume that the information operator X 
is a one-to-one mapping, i.e. m > n and rank X = n. Similarly, n > s and S is full row rank. 
Moreover, we assume that the set X^^{y) has non-empty interior. 

Note that, in a system identification context, the assumption on X and on the consistency set are 
necessary conditions for identifiability of the problem element x G X. The assumption of full-rank S 
is equivalent to assuming that the elements of the vector z = Sx are linearly independent (otherwise, 
one could always estimate a linearly independent set and use it to reconstruct the rest of the vector 
z). We now provide an illustrative example showing the role of these operators and spaces in the 
context of system identification. 

Example 1 (System parameter identification and prediction): Consider a parameter identification 
problem which has the objective to identify a linear system from noisy measurements. In this case, 
the problem elements are represented by the trajectory ^ = ^(t, x) of a dynamic system, parametrized 
by some unknown parameter vector x G X. This may for instance be represented as follows 

n 

^{t,x) =^Xii;i{t) = ^'^{t)x, 

with given basis functions ipi{t), and ^^(t) = [il^i{t) ••• V'n(i)]- ^^ suppose that m noisy 
measurements of ^(t, x) are available for ti < t2 < • • • < tm, that is 

y=Ix + q=[^{ti) ••• ^{tm)Vx + q. (5) 

In this context, one usually assumes unknown but bounded errors, such that \qi\ < p, i = I, . . . ,m, 
that is Q = Boo{p)- Then, the aim is to obtain a parameter estimate using the data y. Hence, the 
solution operator is given by the identity, 

Sx = X 
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and Z = X. The consistency set is sometimes referred to as feasible parameters set, and is given as 
follows 

I~\y) = [xeX:\\y-[^i'{h) ■■■ *(t™)]^x||oo < p} • (6) 

In the case of time series prediction, we are interested on predicting s future values of the function 
^{t, x) based on ?n past measurements, and the solution operator takes the form 

Z = Sx = ^{tm+l,x) = '^^{tm+l)x 



for a one-step prediction, and by 

Z = Sx = {^{tm+l,x), 



,^(tm+s,x)} = [^{tm+l) 



^{t^+s)]~X, 



for a s steps prediction. 

Example 2 (Filtering of dynamic systems): Note that the IBC theoretical setting outlined so-far 
also applies to the problem of filtering. Consider an LTI system described by the following state- 
space model: 

Ck+i = Mk + BkWk, Wk e W; 
Vk = Ckik + qk, Qk & Q (7) 

Zk = Hk^k 
where Wk and q^ denote process and measurement noise characterized by a set-membership description 
given by the compact, convex sets W and Q. The goal is to estimate Zn, the value of the variable 
z at a given time instant ?i from the noisy measurements yk, k = 0, . . . ,n. We will further assume 
that the pairs (A, B) and {A, C) are controllable and observable, respectively, and that rank(i/) = s, 
that is, the elements of z are linearly independent This problem fits the formalism described above 
by defining 



X 



' io ' 
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Wo 


, y = 
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Wn-l 




_yn_ 
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CAB 
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(j^n cA^'-^B CA'^-^B ... CAB CB 



(8) 



S = 



jj^n H^n-iQ HA^'^'^B ... HAB HB 
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Note that in the formulation above, observability of the pair {A, C) guarantees that Z has rank n, 
while the controllability of {A,B), coupled with xwk{H) = s guarantees that S has full row rank. 

The case where W and Q are l^o balls in the respective spaces was addressed in ll35l and ll49l . 
The latter showed that a worst-case globally optimal estimator is linear in the measurements and 
can be found based on solving a single LP problem. However, this estimator is non-recursive and its 
complexity cannot be bounded a priori. Related results were also discussed in |[55l . B3l . 

Next, we define approximation errors and optimal algorithms when q is deterministic or random. 
First, we briefly summarize the deterministic case which has been deeply analyzed in the literature, see 
e.g. |[35l . The definitions concerning the probabilistic case are new in this context, and are introduced 
in Section |IVl 

A. Worst-Case Setting 

Given data y e y, we define the worst-case error r'"'^{A^y) of the algorithm A as 

r^c(^,y)= max |15x-^(y)||. (9) 

xgl-i(y) 

This error is based on the available information y ^ Y about the problem element x ^ X and it 
measures the approximation error between Sx and A{y). An algorithm A^'^ is called worst-case 
optimal if it minimizes r"^(^, y) for any y ^Y . That is, given data y € y, we have 

rT{y) = r"^(X^y) = mf r-™^(Ay). (lO) 

A 
The minimal error r^^{y) is called the worst-case radius of information^. 

This optimality criterion is meaningful in estimation problems as it ensures the smallest approxi- 
mation error between the actual (unknown) solution Sx and its estimate A{y) for the worst element 
X G I~^{y) for any given data y ^ Y. Obviously, a worst-case optimal estimator is given by 
z^^ = ^«c(y), see Figure [I] 

We notice that optimal algorithms map data y into the ip-Chebychev center of the set SI~^{y), 
where the Chebychev center Zc{H) of a set if C Z is defined as 

max \\h — Zc(H)\\ = inf max 11^ — z\\ = rJH). 

heH z£Z heH 

Optimal algorithms are often called central algorithms and Zc{SI~^{y)) = z^"^. We remark that, in 
general, the Chebychev center of a set if C Z may not be unique and not necessarily belongs to H, 
for example, when H is not convex or it is a discrete set. 

'in the IBC context, this error is usually referred to as "local" radius of information, to distinguish it from the so-called 
"global" radius, see 1521 for further details. 



February 25, 2013 DRAFT 



10 

Remark 1 (Geometric interpretation of the Chebychev center): Note that, by construction, for any 
given set H (not necessarily convex, nor connected), the ^p-Chebychev center Zc{H) of H and its 
radius rc{H) are given by the center and radius of the smallest £p norm-ball enclosing the set H, see 
Figure |2] 



7 rc{H) 



Fig. 2. Chebychev center and radius of a generic set H. 

That is, we can compute Zc{H) and rc{H) solving the optimization problem 

minr subject to B{z,r) ^ H. (11) 

Note that, as remarked above, the optimal ball B{z, r) need not be unique. It follows immediately 
that if the set H is centrally symmetric with center z, then z is a Chebychev center of H. o 

The computation of the worst-case radius of information r^^{y) and of the derivation of optimal 
algorithms ^"'^ has been the focal point of several papers in a system identification setting. For 
example, for l^o norms in equation ^, it has been shown in Il35l that an optimal algorithm and the 
radius of information can be computed solving 2n linear programs. In particular, an optimal algorithm 
is easily obtained computing the center of the tightest hyperrectangle which contains the polytope 
SI^^{y). If the norm is £2, then SZ~^{y) is an ellipsoid, and the linear optimal estimator is the least 
squares algorithm 

As{y) = Sxis (12) 



where x\s is defined as 



W^xis - 2/II2 = min \\Ix - y\\2. 

xGX 



In this case, see e.g. |[27l . Ais{y) is the Chebychev center of the ellipsoid SI ^{y), 

zr = AUy)=S{I^I)-'l'^y 
and its worst-case eiTor is given by 

rTiy) = ^"'(A, y) = p^Xm.. [s^ii^iy's) 

where p is defined in (01). In general, A\s is not optimal for other bounding sets such as i^o balls. 
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IV. Probabilistic Setting with Random Uncertainty 

In this section, we introduce a probabilistic counterpart of the worst-case setting previously defined, 
that is we define optimal algorithms ^o"^ and the probabilistic radius rP^{A,y,e) for the so-called 
probabilistic setting when the uncertainty q is random, and e G (0, 1) is a given parameter called 
accuracy. Roughly speaking, in this setting the error of an algorithm is measured in a worst-case 
sense, but we "discard" a set of measure at most e from the consistency set SZ~^{y). Hence, the 
probabilistic radius of information may be interpreted as the smallest radius of a ball discarding a 
set whose measure is at most e, see Figure [S] Therefore, we are decreasing the worst-case radius of 
information at the expense of a probabilistic "risk" e. In a system identification context, reducing the 
radius of information is clearly a highly desirable property. Using this probabilistic notion, we can 
compute a trade-off function which shows how the radius of information decreases as a function of 
the parameter e, as described in the tutorial Example |3] and in the numerical example presented in 
Section IVlIIl 




Fig. 3. Illustration of the probabilistic framework for a generic set H. The measure of the dark (red) set Xt - discarded 
from the consistency set - is at most e, and the probabilistic radius is smaller than the worst-case radius. 

We now state a formal assumption regarding the random uncertainty q. 

Assumption 2 (Random measurement uncertainty): We assume that the uncertainty g is a real ran- 
dorn3 vector with given probability density PQ{q) and support set Q = B{p). We denote by ^iq the 
probability measure generated by PQ{q) over the set Q. 

Remark 2 (Induced measure over I~^{y)): We note that the probability measure over the set Q 
induces, by means of equation ^, a probability measure flx-i over the set Z~^{y). Indeed, for 
any measurable set i3 C X, we consider the probability measure ^q as follows: flx-^{B) = ^q(9 G 
Q\3x £ Bnl^^ (y) I Ix+q = y). This probability measure is such that points outside the consistency 
set Z~^{y) have measure zero, and /ij-i (Z~^(y)) = 1, that is this induced measure is concentrated 
over Z~^(y). This induced measure is formally defined in ||52] Chapter 6], where it is shown that it 

^For simplicity, in this assumption we consider the case when the density of q exists (that is the distribution is 
differentiable). 
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is indeed a conditional measure. Similarly, we denote by pj-i the induced probability density, having 
support over Z~^{y). We remark that Theorem[T]in Section fVlstudies the induced measure /ix-i(-) over 
the set Z~^{y) when q is uniformly distributed within Q, showing that this measure is still uniform. In 
turn, the induced measure /iji is mapped through the linear operator S into a measure over ST~^{y), 
which we denote as /isz:-i- Similarly, the induced density is denoted as psr-i. In Theorem [T] in Section 
rvl we show that the induced measure /i^r-i is log-concave in the case of uniform density over Q. o 

Given corrupted information y ^Y and accuracy e G (0, 1), we define the probabilistic error (to 
level e) r^'^{A, y, e) of the algorithm A as 

rP'{A,y,e)= inf max \\Sx - A{y)\\ (13) 

where the notation T~^{y) \ X^ indicates the set-theoretic difference between X~^{y) and X^^, 

X'\y)\X, = {x eX-\y)\x ^ X,} . 

Clearly, r^'^{A,y,€) < r'"'^{A,y) for any algorithm A, data y ^Y and accuracy level e G (0,1), 
which implies a reduction of the approximation error in a probabilistic setting. 

An algorithm ^o'^ is called probabilistic optimal (to level e) if it minimizes the error rP'^(>l, y,e) 
for any y ^Y and e G (0, 1). That is, given data y GY and accuracy level e G (0, 1), we have 

C(y,6) = rP■■(^P^y,e) = inf r^' {A, y , e) . (14) 

The minimal error ro^{y,e) is called the probabilistic radius of information (to level e) and the 
corresponding optimal estimator is given by 

z^J{e) = A^''{y,e). (15) 

The problem we study in the next section is the computation of ro^{y,e) and the derivation of 
probabilistic optimal algorithms AS' ■ To this end, as in [|52|| . we reformulate equation (IT3T l in terms 
of a chance-constrained optimization problem ll37l 

r^^{A, y, e) = min {r \ v{r, -4) < e} , 

where the violation function for given algorithm A and radius r is defined as 

v{r,A) = fix-^{x (^l-^{y)\\\Sx - A{y)\\ > r) . 

Then, this formulation leads immediately to 

rP''(y,e) = min{r |i;o(r) < e}, (16) 

where the optimal violation function for a given radius r is given by 

Voir) = inf /ix-i{x G X'\y) : ||5x - ^(y)|| > r] . (17) 
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To illustrate the notions introduced so far, we consider the following numerical example. The 
example is tutorial, and it is sufficiently simple so that all relevant sets are two dimensional and can 
be easily depicted. 

Example 3 (Identification of a second order model): We consider a particular case of the system 
parameter identification problem introduced in Example [T] Our aim is to estimate the parameters of 
a second order model 

Vk = xiUk + X2Uk-i + Qk, k = l,...,m (18) 

where the input Uk is a known input sequence. The (unknown) nominal parameters were set to 
[1.25 2.35]\ and m = 100 measurements were collected generating the input sequence {uk} 
according to a Gaussian distribution with zero mean value and unit variance, and the measurement 
uncertainty g as a sequence of uniformly distributed noise with \qk\ < 0.5. Note that, in this simple 
case, the operator S is the identity, and thus X = Y and the sets X~^{y) and SI~^{y) coincide. That 
is, the solution coincides with the parameters, and the goal is to estimate Zi = Xi, i = 1,2. 

First, the optimal worst-case radius defined in (ITOl i and the corresponding optimal solution have 
been computed by solving four linear programs (corresponding to finding the tightest box containing 
the polytope SZ~^{y)). The computed worst-case optimal estimate is z^'^ = [1.2499 2.3551]^ and 
the worst-case radius is r^'^{y) = 0.0352. For comparison, we also computed the classical least- 
squares estimate defined in (IT2] | obtaining Ais{y) = [1.2873 2.3190]^. Note that the least-square 
estimate is not worst-case optimal as expected, and it is not even interpolatory, since it is outside of 
the consistency set 5Z~^(y) (see |[52l for a formal definition of interpolatory algorithm). 

Subsequently, we fix the accuracy level e = 0.1, and aim at computing a probabilistic optimal radius 
and the corresponding optimal estimate according to definitions (fT4l) and (fTSl ). By using the techniques 
discussed in Section |Vl we obtained r^'{y, 0.1) = 0.0284 and ^^(0.1) = [1.2480 2.3540]"^, which 
represents a 25% improvement. 

V. Random Uncertainty Uniformly Distributed 

In this section, which contains the main technical results of the paper, we study the case when q 
is uniformly distributed over the ball Q = B{p), i.e. q ~ Uq and fiQ = fJ-uig). First, we address a 
preliminary technical question: If hq is the uniform measure over Q, what is the induced measure 
fLx-i over the set Z~^{y) defined in equation ((21).'' The next result shows that this distribution is indeed 
still uniform. Furthermore, we prove that the induced measure on ST~^{y) is log-concave. 

Remark 3 (Log-concave measures and Brunn-Minkowski inequality): We recall that a measure ;u(-) 
is log-concave if, for any compact subsets A, B and A € [0, 1], it holds 

^i{\A + (1 - \)B) > ii{Af^i{B)^-^ 
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Fig. 4. Consistency set and relevant quantities for Example [3] The worst-case optimal radius r"'^(j/) con'esponds to the 
radius of the dash-dotted box enclosing the polytope ST~^(y). Its center, denoted by a cross +, represents the optimal 
worst-case estimate z^^ . The probabilistic optimal (to level e = 0.1) radius ro'^(j/,e) corresponds to the radius of the 
solid-line box, which is the "optimal set" X^ that, according to definition ( I13t . discards a set of measure e from X^^{y). 
The discarded set I~^{y) \ X^ is represented by the dark (red) area. The center of this box, denoted by a star *, represents 
the optimal probabilistic estimate 2;P''(e). 



where A^ + (1 — ^)B denotes the Minkowski sun£| of the two sets XA and (1 — X)B. Note that the 
Brunn-Minkowski inequaUty B2l asserts that the uniform measure over convex sets is log-concave. 
Furthermore, any Gaussian measure is log-concave. o 



Theorem 1 (Measures over I ^(y) and SI ^{y))-' Let q ~ U{Q) with Q = B{p), then, for any 
y G y it holds: 
(i) The induced measure /xj-i is uniform over X~^{y), that is /ii-i = fJ.ug:-i[y))', 
(ii) The induced measure flsi-^ over SX^^{y) is log-concave. Moreover, if 5 G M"'", then this 
measure is uniform, that is /isz:-i = fJ-u{SX-^{y))- 
Proof: Consider the transformation matrix T = [Ti T2], where Ti is an orthonormal basis of the 
column space of I and T2 is an orthonormal basis of the null space of I~^. Furthermore, define the 
linear transformation q = T~^q. Then, if we define the set Q = {g G M™ | T"^ q G Q} we have that, 
if the random variable q is uniform on Q, then the linearly transformed random variable q is uniform 
on Q (see e.g. Ell). Then, by multiplying equation ([2]) from the left by T~l and defining Xi = T^X 

^ The Minkowski sum of two sets A and B is obtained adding every element of A to every element of B, i.e. A + B — 

{a + b\a e A,b e B}. 
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and yt = T^y, qi = Tjq, i = 1, 2, we get 

Xix + q, =m ^^^^ 

q2 = m, 

since, by construction, T2T = 0. It follows from definition (O that a point x & X belongs to 
X^^{y) if and only if there exists q = [?^] € Q such that (|T9l ) holds, i.e. if there exist qi in the 
set Qi = {^1 G M" I [|^] G Q} for which Zix + qi = yi. Note that the set Qi represents the 
intersection of the set Q with the hyperplane {q = [gl] G M™ \q2 = y2}- Since q is uniform on Q, it 
is also uniform on any subset of Q, and in particular on this intersection set. Hence, qi is uniformly 
distributed on Qi. 

Statement (i) is proved noting that, from (|T9] l. an element x G Z~^{y) C M" can be written as 
the mapping of ^1 G Q C M" through the one-to-one affine transformation x = Z^'^{qi — yi). Since 
bijective linear transformations preserve uniformity [l39l . it follows that the random variable x is 
uniformly distributed on Z~^{y). 

Point (ii) follows immediately from the fact that the image of a uniform density through a linear 
operator M" -^ W with s < ?i is log-concave (see e.g. |[39l . ll50l ). D 



The result in this theorem can be immediately extended to the more general case when Q is a 
compact set. We now introduce an assumption regarding the solution operator S. 

Assumption 3 (Regularized solution operator): In the sequel, we assume that the solution operator 
is regularized, so that 5 = [5 Os^„_s] , with S G W'^. 

Remark 4 (On Assumption [J]).- Note that the assumption is made without loss of generality. Indeed, 
for any full row rank S G M**'", we introduce the change of variables T = [Ti T2], where Ti is an 
orthonormal basis of the column space of 5^ and T2 is an orthonormal basis of the null space of 
S (in Matlab notation, we write Ti = orth(5^) and T2 = null(5) ). Then, T is orthogonal by 
definition, and it follows 

z = Sx = STT\ = S [Ti T2] T~^x = [STi ST2] T\ = [S Os,n-s] x = Sx, 

where we introduced the new problem element x = T^x and the new solution operator S = ST. 
Note that, with this change of variables, equation (|2]i is rewritten as 

y =Zx + q, 

by introducing the transformed information operator Z = ZT. We observe that any algorithm A, 
being a mapping from Y to Z, is invariant to this change of variable. It is immediate to conclude 
that the new problem defined in the variable x and operators Z and S satisfies Assumption |3] o 
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Instrumental to the next developments, we first introduce the degenerate cone (cylinder) in the 
element space X, with given "center" Zc ^ Z and radius r, as follows 

C{zc,r) = {x G M" I \\Sx -Zc\\<r}c X. (20) 

Note that this set is the inverse image through S of the norm-ball B{zc,r) C Z. Moreover, due to 
Assumption [3j the cylinder C{zc,r) is parallel to the coordinate axes, that is any element x of the 
cylinder can be written as 

Hence, for the case s < n, the cylinder is unbounded, while for s = n it is simply a linear 
transformation through S~^ of an £p norm-ball. Next, for given center Zc € Z and radius r > 0, we 
define the intersection set between the cylinder C{zc,r) and the consistency set X~^{y) 

^{zc,r)=I-\y)nCiz„r)cX (21) 

and its volume 

0(ze,r) = vol[$(zc,r)]. (22) 

Finally, we define the set T-L{r) of all centers Zc G M* for which the intersection set ^{zc,r) is 
non-empty, i.e. 

n{r) = {zc£R'\^izc,r)^^}. (23) 

Note that, even if the cylinder C(zc, r) is in general unbounded, the set ^{zc, r) is bounded whenever 
Zc G 'H(r), since X~^{y) is bounded. 

We are now ready to state the main theorem of this section, that provides useful properties from 
the computational point of view of the optimal violation function defined in (fTTl ). 

Theorem!: Let q ~ U{Q) with Q = B{p), and S = [S {)s,n-s], with S G M^'*. Then, the 
following statements hold 

(i) For given r > 0, the optimal violation function Vo{r) is given by 

where (J)o{t) is the solution of the optimization problem 

4>o{r) = max 4>{zc,r) (25) 

with (/)(2c,r) and ?^(r) defined in (l22l) and (l23l) . respectively, 
(ii) For given r > 0, the function (l22l) is continuous semi-strictly quasi-cone avcl in Zc G 'H{r); 

''a function / defined on a convex set A £ R" is semi-strictly quasi-concave if /(y) < f{\x + (1 — A)j/) holds for any 
x,y £ A such that /(a;) > f{y) and A G (0, 1). 
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(iii) The function Vo{r) is right-continuous and non-increasing for r > 0. 

Proof: Recalling that fiQ is the uniform measure over Q, from dTTT i we write 

vol [{x e 2:-i(y) I \\Sx - A{y)\\ > r}] 



Vo{r) = inf 



vol [X- 
1 


'(y)] 


vol [I~ 

1 


Hy)] 



A voi[z-i(y)] 

inf vol [|x G X~ (y) | ||52; — Zc\\ > r}] 
inf vol [|x € I^^{y) \ x ^ C(zc, r)}] 

- voiix-ny)]'-^"^"^^^"'^^^^^^""'^^- 

Next, we note that this equation can be rewritten as the following maximization problem 

^o(r) = 1 - ^^^J-Uy^l sup vol [l'\y) D C{z,,r))] (26) 

= 1 1 r-r-u M sup4>{zc,r)). (27) 

vollT 1(2/)] z. 

The statement in (i) follows immediately considering that this optimization problem can be restricted 
to the set ?^(r) where the intersection is non-empty. The existence of a global maximum is guaranteed 
because ?^(r) is convex (and thus compact) and the function cl){zc,r)) is continuous in Zc- 



To prove point (ii), note that problem (I25l l corresponds to maximizing the volume of the intersection 
^{zc,r) between the two convex sets Z~^{y) and C{zc,r). One of them, I"^{y), is fixed, while the 

\ S~^Zc 1 
other one is the set obtained translating the cylinder C(0, r) by . Similar problems have 

L J 
been studied in convex analysis, see for instance lISTI . In particular, the proof of continuity follows 

closely the proof of Lemma 4.1 in |(2p]|. That is, consider an arbitrary direction ^ € M^, and let Vc 

be the volume of the set obtained projecting C(0, r) to the hyperplane normal to ^. Then, for any 

e > 0, we have that the difference between the volume of ^{zc,r)) and (f){zc + e.^, r)) is bounded by 

eVc||C||. Hence, cl){zc,r)) — cj){zc + ^C^r)) converges to zero for e — )■ 0, thus proving continuity. 

To prove semi-strict quasi-concavity, consider two points zi, Z2 G ^(r) such that (j){zi,r) > cl){z2,r). 

Consider then a point zx = Xzi + {1 — \)z2 where A G (0, 1). From convexity of ^(r) it follows 

that Zx G ■H(r). Then, the following chain of inequalities hold 

(/>(zA,r)^/" = 0(Azi + (1 - A)z2)i/" (28) 

> vol[A$(zi,r) + (l-A)$(zi,r)]^/" (29) 

> Avol[$(zi,r)]^/" + (l-A)vol[$(zi,r)]^/" (30) 
= A(/>(zi,r)i/- + (l-A)(/.(zi,r)V- (31) 

> A(/)(z2,r)i/'' + (l-A)(/>(zi,r)i/" (32) 
= <^(^2,r)i/'^ (33) 
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where (|29l ) follows from ||57] Theorem 1], (|30] | follows from the Brunn-Minkowski inequality for 
convex analysis ll42l and (l30l ) follows from the hypothesis that cl){zi,r) > (f){z2,r). From this chain 
of inequalities, we have (j){zx,r) > cl){z2,r), which implies semi-strict quasi-concavity. 

To prove point (iii), we note that Vo{r) is right continuous and non-increasing if and only if 
4>o{f) is upper semi-continuous and non-decreasing. To show upper semi-continuity of the supre- 
mum value function (j^oif), consider the radius r = r'^'^{y), which is nonzero since Z^^{y) is 
assumed non-empty. Then, from point (ii) it follows that, for any z € ?^(r), the upper level set 
F{z) = {z £ T-L{f) I 4>{z,f) > (/)(z,7~)} is strictly convex. Hence, the function </>(•, r) is quasi-convex, 
continuous and satisfies the boundedness condition defined in lISTI . Then, upper semi-continuity of 
4>o{r) follows from direct application of lISTl Theorem 2.1]. Finally, to show that (pair) is non- 
decreasing, take < ri < r2 and denote zi and Z2 be the optimal solutions corresponding to 0o(^i) 
and 4'o{f2), respectively. It follows that 

4'{zoi,r2) <(p{zo2,r2) = (poir2), (34) 



since 2:02 is the point where the maximum is attained. On the other hand, from definition (I2TI ) and 
ri < r2 we have 

$(zoi,ri) = (X-i(y) n C{zoi,n)) C {l~\y) H C{zoi,r2)) 

and hence 

(poiri) = 4>izoi,ri) <(p{zoi,r2). (35) 

Combining (l34l ) and (1351 ) it follows (/>o(^i) < <Ao(''2)- D 

Remark 5 (Unimodality of the function (j){zc,r)): Point (ii) in Theorem|2]is crucial from the com- 
putational viewpoint. Indeed, as remarked for instance in |fT9l , a semi-strictly quasi-concave function 
cannot have local maxima. Roughly speaking, this means that the function 4>{-,r) is unimodal, and 
therefore any local maximal solution of problem (l25T l is also a global maximum. Note that from the 
Brunn-Minkowski inequality it follows that, if there are multiple points z^{i) where (!){•) achieves its 
global maximum, then the sets (^{zo , r) are all homothetic, see ll42l . Further, from the definition of 
$(-,r), this implies that <I>(4'\r) = (t{zo\r) + Zq - Zo ■ 

This fact is illustrated in Figure |5] where we plot the function (j){zc,r) for the tutorial problem 
considered in Example [3l for two different values of r. In the figure on the left, the two sets Z~^(y) 
and C{zc, r) always intersect, the function is unimodal, and clearly presents a unique global maximum. 
In the figure on the right, the radius r is smaller, and there are values of Zc for which C{zc,r) is 
completely contained in Z~^{y), thus leading to the "flat" region on the top. However, note that this 
is the only flat region, so that the function is "well-behaved" from an optimization viewpoint. o 
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Fig. 5. Function (f>{zc,r) for tiie tutorial problem considered in Example |3] for r = 0.0284 (a) and r = 0.0150 (b). 

Remark 6 (Probabilistic radius and probabilistic optimal estimator): Theorem |2] provides a way 
of computing the optimal probabilistic radius of information ro'^{y,e). Indeed, for given r > 0, the 
probabiUstic radius of information (to level e) is given by the solution of the following one-dimensional 
"inversion" problem 

rP''(y, e) = min {r | Vo{r) < e} . (36) 

Note that point (iii) in Theorem |2] guarantees that such solution always exists for e G (0, 1), and it 
is unique. The corresponding optimal estimate is then given by 

.P'-(e) = ^P^(y,6) = Zo(rP^(y,e)), 

where we denoted by Zo{r) a solution of the optimization problem (1251 ). 

To illustrate, continuing with the tutorial Example |3] we plot in Figure |6t a) the function Vo{r) for 
r € (0, r^'^]. We see that Vo{r) is indeed non-increasing (actually, it is strictly decreasing), and hence 
the inverse problem (|36] | has clearly a unique solution for any e € (0, 1). o 

Theorem |2] shows that the problem we are considering is indeed a well-posed one, since it has a 
unique solution (even though not a unique minimizer in general). However, its solution requires the 
computation of the volume of the intersection set ^{zc,r), which is in general a very hard task. A 
notable exception in which the probabilistic optimal estimate is immediately computed for q uniformly 
distributed in Q is the special case when the consistency set X~^(y) is centrally symmetric with center 
X. Indeed, in this case it can be seen that ST~^{y) is also a centrally symmetric around z = Sx, 
and so is the density jlsir^. Hence, the optimal probabilistic estimate coincides with the center z. 
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Fig. 6. (a) Plot of the function Vo{r) for Example|3] and computation of the optimal probabilistic radius for e = 0.1. (b) 
Plot of the "optimal" loa balls for different values of r. The crosses denote the corresponding optimal estimates r'^^{y, e). 



since it follows from symmetry that the probability measure of the intersection of SI~^{y) with an 
ip norm-ball is maximized when the two sets are concentric. Moreover, this estimate coincides with 
the classical worst-case (central) estimate, which in turn coincides with the classical least squares 
estimates. 

Remark 7 (Weighted I2 norms): Note that the requirement of Z^^{y) being centrally symmetric is 
quite demanding in general, but holds naturally when the random uncertainty q is uniformly distributed 
in a (weighted) £2 ball of radius p, that is 



Q = {q-h\\w<p], m\w = ^/Cwi, vr^o, 

and iiq{Q) = Uq{Q). This framework has been also considered in the classical set-membership 
literature, see for instance [271, and it is well-known that in this case the set X~^(y) is the ellipsoid 



^~\y) 



G X I (x - xisV X^W^Z (x-xis)<p , 



(37) 



centered around the (weighted) least-squares optimal parameter estimate 

-1 



X\s 



l^Wl] I^Wy. 



Hence, it follows from symmetry that, for any e S (0, 1), the probabilistic optimal estimator to level 
e is given by 

However, we are not aware of any easy formula to compute the corresponding probabilistic optimal 
radius ro'^(y,e), see Section IVlIl for further comments. o 
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VI. Randomized and deterministic algorithms for violation function 

APPROXIMATION 

In this section, we concentrate on the solution of the optimization problem defined in (|25] ). Theorem 
[2] for fixed r > 0. For simplicity, we restate this problem dropping the subscript from Zc 

(P-max-int) : max (t){z,r), (t>{z,r) =Yo\\Z~^{y) n C{z,r)\ . (38) 

z£H{r) 

First, note that this problem is very hard in general. For instance, for £i or ioo norms, the consistency 
set Z~^{y) is a polytope and C{z, r) is a cylinder parallel to the coordinate axes whose cross-section 
is a polytope. Hence, even evaluating the function ^(z, r) appearing in (1381 ) amounts at computing 
the volume of a polytope, and this problem has been shown to be NP-hard in |[30l . 

Remark 8 (Volume oracle and oracle-polynomial-time algorithm): For the case of polytopic sets, 
the papers |[T], ll20l study the problem (P-max-int) in the hypothetical setting that an oracle exists 
which satisfies the following property: given r > and z € ?^(r), it returns the value of the function 
(j){z, r), together with a sub-gradient of it. In this case, in ([Tl a strongly polynomial-time (in the number 
of oracle calls) algorithm is derived. Note that, even if the problem is NP hard in general, one can 
compute the volume of a polytope in a reasonable time for considerably complex polytopes in modest 
(e.g. for n < 10) dimensions, see 161 . In this particular case, for l^o norms, the method proposed by 
II20I may be used. For instance, for the tutorial Example |3] all relevant quantities have been computed 
exactly by employing this method. However it should be remarked that, for larger dimensions, the 
curse of dimensionality makes the problem computationally intractable, and alternative methods need 
to be devised. o 

In the next subsections, we develop random and deterministic relaxations of problem (P-max-int) 
which do not suffer from these computational drawbacks. 

A. Randomized algorithms for computing (P-max-int) 

In this section, we propose randomized algorithms based on a probabilistic volume oracle and 
a stochastic optimization approach for approximately solving problem (P-max-int) for generic £p 
norms. First of all, we compute a bounded version of the cylinder C{z,r). To this end, we note that 
bounds Xj~, xf on the variables Xi, i = s + 1, . . . ,n, can be computed as the solution of the following 
2{n — s) convex programs, 

x~ = min Xi xf = max Xj 

, , i = s + 1, . . . ,n. (39) 

subject to x€X~^(y) subject to xGl~^{y) 

The problems above are convex, and for generic £p norms can be solved by any gradient-based 
method. In particular, problem (|39] l reduces to the solution to 2(n — s) hnear programs in the case 
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of ii or £00 nomis. Then, under Assumption |3] we define the cyUnder 



C{z,r) 



X G 



< r, X- < Xi < xj ,i = s + 1, . . . , n > . 



Note that the cyUnder C{z,r) is bounded, and has volume equal to 

{2rYT'{l/p+l) 



vol [C(z,r)] 



|det(5)|r(s/p + 



(y n K+-^r) = vc. 



(40) 



(41) 



i=s+l 



By construction, we have that, for any r > and z € ?^(r), <l>(2;,r) = Z^^{y) n C(z,r). Note that 
independent and identically distributed (iid) random samples inside C{z, r) can be easily obtained from 
iid uniform samples in the £p-norm ball, whose generation is explained in [8]. Then, a probabilistic 
approximation of the volume of the intersection <l>(z, r) may be computed by means of the randomized 
oracle presented in Algorithm [T] which is based on the uniform generation of iid samples in C{z, r). 

Algorithm 1 Probabilistic Volume Oracle 
1. Random Generation 

Generate N iid uniform samples C,^^\ . . . , C,^^^ in the s-dimensional ball B{z, r) 

- For i = 1 to A^ 

- Generate s iid scalars according to the unilateral Gamma density 7^ ~ Gi/p 1 

- Construct the vector 77 € R" of components rjj = Sj^, , where Sj are iid random signs 

- Let C*-*^ = z + r tt;^/" -[pjp where w is uniform in [0, 1] 
End for 

Generate A^ iid uniform samples ^^^\ . . . ,^^'^^ 

- For i = 1 to A^ 

- Generate ^* uniformly in the interval [x~_^_j, xf_^_j], j = 1, . . . ,n — s 
End for 

Construct the random samples in C{z,r) as follows 



X 



(i) 



C 



(i) 



i = l,...,N 



2. Consistency Test 



- Compute the number of samples inside Z ^{y) as follows 



N 



Ng = J2^(\\lx'^'^-y\\<P 



i=l 



3. Probabilistic Oracle Return an approximation of the volume (j){z,r) as follows 



^^[z,r) = ^Yc 



where Vc is defined in (|4T]) 
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Note that the expected value of the random variable (J)n{z, r) with respect to the samples x^^\ ■ ■ ■ i X^^^ ^ 
C{z,r) is exactly the volume function 4>{z,r) appearing in (P-max-int) that is 



E 



0Af(z,r) =(j){z,r). 



This immediately follows from the linearity of the expectation 



E 



(^Af(z,r) 



E 



1 



N 



^J:I M)€X-(y) Vc 



i=l 



1 ^ 

i=l 



Vc- 



Then, we have 



E 



I f X^*) G Z-Hy) 



= 1 • Prob [x^) € l~\y)] + • Prob {x« X-i(y)} 
= vol [$(z, r)] /vol [C(z, r)] = (^(z, r)/Vc. 
Hence, we reformulate the problem at (P-max-int) as the following stochastic optimization problem 



max E 

zGH{r) 



(t>N{z,r) 



This problem is classical and different stochastic approximation algorithms have been proposed, see for 
instance |[32l . B4l and references therein. In particular, in this paper, we use the SPSA (simultaneous 
perturbations stochastic approximation) algorithm, first proposed in ll46l . and further discussed in 
ll48l . Convergence results under different conditions are detailed in the literature, see in particular the 
paper l!24l which applies to non-differentiable functions. This approach is outlined in Algorithm |2] 



Algorithm 2 SPSA approach to (P-max-int) 



1. Initialization Let fc = 0, choose initial feasible center zq € ?^(r) and stepsize sequences 
{«fe}, {cfc} satisfying a^ -^ 0, X]fc "fc = oo, c^ ^ 0, J2k'^k = oo 

2. Bernoulli generation 



- Generate s iid Bernoulli points A^ € {0, 1}' 

- Define [A^^] 
3. Approximate gradient 



A. 1 • • • ^ki 



- Compute the two perturbed values (pj^ = (f)Nizk ± CfcAfc,r) 

- Compute an approximate (sub)gradient as 



^^(-.,0 = [A,7^]^^^^ 



4. SUBGRADIENT ASCENT 



Zk+1 = Zk + akd(f)Nizk,r) 
5. ITERATION Let A; = A; + 1 and goto 2 
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Remark 9 (Scenario-based algorithms): An alternative approach based on randomized methods 
can be also devised employing results on the scenario optimization method introduced in Q. In 
particular, exploiting the results on discarded constraints, see lOTI . |[T3l . an alternative algorithm can 
be constructed. The idea is as follows: (i) generate N samples x*^*^ in X~^{y) according to the induced 
measure flx-h ii) solve the discarded-constraint random program 

min 7 (42) 



2.7 



t. ij^i(||5x«_^)||>^)<e (43) 



L 

ieiL 

where II is a set of L indices constructed discarding in a prescribed way N — L indices from 
the set 1, 2, . . . , A^. Then, in ifTTl . |[T3l it is shown how to choose N and the discarded set II to 
guarantee, with a prescribed level of confidence, that the result of optimization problem (|42] | is a 
good approximation of the true probabilistic radius ro\y,e). However, this approach entails many 
technical difficulties, such as the random sample generation in point (i) and the optimal discarding 
procedure in point (ii), whose detailed analysis goes beyond the scope of this paper, and it is studied in 
IITSl . We also point out that a different approach, also based on scenario optimization and discarded 
constraints, has been developed in llT2l for identification and reliability problems, introducing the 
concept of interval predictor models. o 

B. A semi-definite programming relaxation to (P-max-int) 

In this section, we propose a deterministic approach to (P-max-int) based on a semidefinite 
relaxation of the problem. We develop our approach focusing on the ioo norm; extensions to ii 
and £2 norms are briefly discussed in Remark [10] First note that, in the case of ioo norms, Q is 
an hypercube of radius p and therefore X~^{y) is the polytope Vx defined by the following linear 
inequalities 

X-\y) = {x G M" I ||Xx - ylloo < p} (44) 

r X 1 r pi + y 1 1 

x€M"| x< ) =Vx (45) 



X 


X < 


pl + y 


-X 




p^-y 



where 1 is a vector of ones, 1 = [1 1 • • • 1]^. Since the exact computation of the volume of the 
intersection of two polytopic sets is in general costly and prohibitive in high dimensions, as discussed 
in Remark [8j we propose to maximize a suitably chosen lower bound of this volume. This lower 
bound can be computed as the solution of a convex optimization problem. The idea is to construct, 
for fixed r > 0, the maximal volume ellipsoid contained in the intersection <l>(z,r), which requires 
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to solve the optimization problem 



max vol[£{x£,P£)] 

z,xe,P£ 



(46) 



subject to £{xe,Ps) ^ <^[z,r), 
where the ellipsoid of center xg and shape matrix Pg is 

£{x£, Ps) = {x(^W^\x = xe + Pew, \\w\\2 < 1} . 

The problem of deriving the maximum volume ellipsoid inscribed in a polytope is a well-studied one, 
and concave reformulations based on linear matrix inequalities (LMI) are possible, see for instance 
Q, m. For completeness, we report this result in the next theorem. 

Theorem 3: Let q ~ U{Q) with Q = B{p), and S = [S Gs,n-s], with S G W^" . Then, for given 
r > 0, a center that achieves a global optimum for problem (|46] | can be computed as the solution of 
the following semi-definite programming (SDP) problem 



Zq P(r) € arg^ min — logdetP£• 



subject to i-*£ ^ and 

{p + eJ{y-Xxs))In PeT^ei 

* P + eJ{y-Xxs) 
i = 1, . . . ,m 

{p-ej{y - Ix£ ))In - Pel^Si 

* P- ejiv - Ixe) 
i = 1, . . . ,m 

(r + ejiz - Sx£))In PsS^ei 

• r + ej{z — Sxg) 
i = l,...,s 

(r - ej{z - Sxe))In PsS^Ci 

• r — ej{z — Sxg) 
i = l,...,s, 



hO, 



^0, 



^0, 



^0, 



(47) 



(48) 



(49) 



(50) 



where e^ and ej are elements of the canonical basis of M™ and M*, respectively. Moreover, for all 

r > 0, fo ^(r) > Vo{r), where we defined 

(j)(zf'^{r),r) 

° ^^ voi[z-i(y)] 

Proof: The theorem is immediately proved seeing that (l47T i. (l48T l impose that £'(x£:, Ps) C X^^{y) 

while (l49l ). (l50b impose that £'(x£:,P£) C C{z,r). This problem is an SDP since the equations are 

linear matrix inequalities in the variables z,X£,P£, and the cost function is convex in P^. O 
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r . 



\ ! ++ ■ \ ^> 



1.21 1.22 1.23 1.24 1.25 1.2S 1.27 1.28 



Fig. 7. Optimal loo ball for Example |3] for r = 0.0284 (in blue), and optimal ball computed as the solution of the SDP 
relaxation, with corresponding ellipse £{xe,Pe) (in red). 

Remark 10 (SDP relaxations for £i and £2)' An approach identical to that proposed in Theorem 
[3] can be developed for ii norm, considering that also in this case the sets X~^(y) and C{z,r) are a 
polytope and a cylinder with polytopic basis. Similarly, an analogous algorithm can be devised for 
(weighted) £2 norm. In this case, one should maximize the volume of an ellipsoid contained in the 
intersection of Z~^(y) and C{z, r), which are respectively the ellipsoid defined in (|37| | and a cylinder 
with spherical basis. It can be easily seen, see e.g. O, that this latter problem easily rewrites as a 
convex SDP optimization problem. o 



VII. Random Uncertainty Normally Distributed and Connections with 

Least-Squares 

In this section, we concentrate on the case when the uncertainty q is normally distributed with 
mean value q and covariance matrix S >- 0, that is g ~ -^g,s> and the set Q coincides with M™. This 
permits to draw a bridge between the probabilistic setting introduced in this paper and the classical 
theory of statistical estimation, which is usually based on normally distributed additive noise. Indeed, 
it is well known, see e.g. ESl . Il33l that the minimum variance unbiased estimator for the linear 
regression model (|2]i is given by the Gauss-Markov estimator 

which coincides with the (weighted) least-squares estimator discussed in Remark |7] for W = S~^. 
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We first remark that this minimum variance problem falls into the average setting of IBC, see |22|. 
In particular, we recall that the IBC average setting has the objective of minimizing the expected 
value of the estimation error, that is, for given y, the optimal average radius is defined as 

r^(2/) = i^f(E[||5x->l(y)||2])^/^ (51) 

and E [•] denotes the expected value taken with respect to the conditional measure /ij-i introduced 
in Remark |2] (which is also Gaussian, due to well-known properties of normal measures). It follows 
that the optimal average estimate is immediately given by 

Zq — OXls, 

for any y ^ Y . Moreover, in ||52] Chapter 6] it is proven that the optimal average radius does not 
depend on the measurement y, and it can be computed in closed form as 



C = C(y) = VTrace(5XTS-iI5T). 

For what concerns the probabilistic optimal estimate, which is the subject of this paper, we first 
remark that in the case of normally distributed noise, the definition of the probabilistic radius (IT3T l 
still applies, observing that the consistency set Z~^(y) defined in ^ in this case is given by 

X~^{y) = {x e X\ there exists q ^W : y = lx + q} , 

and is unbounded. Hence, the "discarded" set X^ in (fT3] ) can be also unbounded. Note that this is not 
an issue, since /ij-i is defined over all M", so that the measure of unbounded sets is well defined. 

Similarly to the worst-case and the average settings, the optimality properties of the least-square 
solution still hold for the probabilistic setting. Indeed, in [52, Chapter 8] it is proven that the optimal 
probabilistic estimate (to level e) for normal distributions is given by 

for any y E y. Closed-form solutions for the computation of the probabilistic radius ro' (e) are not 
available, and in |[52l Chapter 8] the following upper bound is given 



rl\e,y)<^2\n-^rl\ for all y G F. 

However, it is also observed that this bound is essentially sharp when the noise variance is sufficiently 
small. 
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VIII. Numerical example 

As a numerical example, we consider a randomly generated instance of ([2]). In particular, m = 150 
random measurements of an unknown n = 5 dimensional vector were generated taking 

I = round(20 * rand(m, n) — 10) 
q = p(2 * rand(m, 1) — 1), 

with p = 5. Similarly, the solution operator was randomly chosen as 

-5 10 -7 

3-4700 

2 6 4 

First, the optimal worst-case radius and the corresponding optimal solution have been computed 
by solving ten linear programs (corresponding to finding the tightest box containing the polytope 
SX~^{y), see f35j). The computed worst-case optimal estimate is z^'^ = [—1.831 5.839 11.883]^. 
and the worst-case radius is r'^^{y) = 0.5791. Subsequently, in order to apply our probabilistic 




Fig. 8. Consistency set, and optimal "box" discarding a set of measure e = O.L The probabilistic optimal radius r^^{y, 0.1) 
corresponds to the radius of this box. The center, denoted by a star, represents the optimal probabilistic estimate Za^{y). 

framework, we fixed the accuracy level to e = 0.1, and computed the probabilistic optimal radius 
and the corresponding optimal estimate according to definitions (fT4l) and dTSl ). For this size, we 
were still able to use the techniques discussed in Remark [8] for computing (P-max-int) exactly. By 
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employing a simple bisection search algorithm over Vo{r), the probabilistic radius of information was 
computed as ro'^(y, 0.1) = 0.3074. The corresponding optimal probabilistic estimate is given by and 
Zo^{0.1) = [—1.773 5.869 11.969]^. Note that the reduction in terms of radius of information is quite 
significant, being of the order of 50%. The meaning of our approach is well explained in Figure 
m Indeed, in this figure we see that we look for the optimal "box" discarding a set of probability 
measure e = 0.1. Note that, in this figure, the volume of the "discarded set" is clearly more than 
10% of the total volume. The reason of this is that the probability of the discarded set is measured 
in the (five dimensional) X space. Figure |9] shows a plot of the violation function Vo{r) computed 
using the different techniques discussed in this paper. It can be observed that all methods provide 
very consistent results. 



f„(r) 




Fig. 9. Function Vc,{r) for the numerical example. The blue line is the solution obtained computing the volume using the 
deterministic techniques discussed in Section |V] the red line coixesponds to the SPSA-algorithm and the green one is the 
SDP relaxation. 



IX. Conclusions 

This paper deals with the rapprochement between the stochastic and worst-case settings for sys- 
tem identification. The problem is formulated within the probabilistic setting of information-based 
complexity, and it is focused on the idea of discarding sets of small measure from the set of 
deterministic estimates. The paper establishes rigorous optimality properties of a trade-off curve, 
called violation function, which shows how the radius of information decreases as a function of the 
accuracy. Subsequently, randomized and deterministic algorithms for computing the optimal violation 
function have been presented. Their performance has been successfully tested on a numerical example. 
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